Skip to content

Add Parquet variant shredding support#332

Merged
CurtHagenlocher merged 6 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding
Apr 29, 2026
Merged

Add Parquet variant shredding support#332
CurtHagenlocher merged 6 commits intoapache:mainfrom
CurtHagenlocher:VariantShredding

Conversation

@CurtHagenlocher
Copy link
Copy Markdown
Contributor

What's Changed

Implements the Parquet variant shredding spec end-to-end in a new Apache.Arrow.Operations.Shredding namespace, alongside minor changes to the base scalar and array types.

Operations.Shredding reader side:

  • ShreddedVariant / ShreddedObject / ShreddedArray ref-struct trio exposing typed columns and residual bytes side-by-side.
  • VariantArrayShreddingExtensions adds GetShreddedVariant(i) and GetLogicalVariantValue(i) on VariantArray.
  • ShredSchema.FromArrowType derives a shredding schema from an Arrow typed_value type, rejecting unsupported types (uint32, fixed-size-binary(N≠16)).

Operations.Shredding producer side:

  • VariantShredder decomposes a column of VariantValues against a ShredSchema into shared metadata + per-row ShredResults.
  • ShreddedVariantArrayBuilder assembles those into a shredded VariantArray with a typed_value Arrow tree matching the schema.

Apache.Arrow changes:

  • VariantExtensionDefinition accepts struct<metadata, value?, typed_value?> layouts in addition to the plain unshredded form.

  • VariantType gains IsShredded / HasValueColumn / HasTypedValueColumn / TypedValueField properties.

  • VariantArray.GetVariantValue and GetVariantReader throw on shredded columns with a pointer to the Operations.Shredding extensions.

  • The public VariantArray(IArrowArray) constructor now infers the VariantType (shredded or not) from the storage shape.

  • Operations gains a project reference to Apache.Arrow; Apache.Arrow does not reference Operations.

    Apache.Arrow.Scalars changes:

  • VariantValueWriter.CopyValue(VariantReader source) transcodes a reader into this writer, re-resolving field IDs against the writer's metadata dictionary. Supports cross-dictionary transcoding and multi-source merge-into-one-dictionary workflows.

  • VariantMetadataBuilder.CollectFieldNames(VariantReader source) is the two-pass companion that accumulates source field names into the target metadata builder.

Validation:

  • Conformance tests run against the Iceberg shredded-variant corpus in apache/parquet-testing (test/parquet-testing/shredded_variant/). test/shredded_variant_ipc/regen.py converts each case-NNN.parquet to an Arrow IPC file via pyarrow; 137 resulting .arrow files are checked in so CI needs no Python. All 128 valid conformance cases pass; 6 schema-invalid and data-invalid cases are rejected with clear errors; 3 "spec-invalid but permissive" INVALID cases are documented as read-without-throw.
  • Additional round-trip, reader-style, and builder tests were implemented

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end Parquet shredded-variant support (reader + producer) under Apache.Arrow.Operations.Shredding, with supporting enhancements to Arrow Variant scalar/array APIs and conformance fixtures converted to Arrow IPC for CI.

Changes:

  • Introduces Apache.Arrow.Operations.Shredding types (e.g., ShredType, ShredOptions, and shared helpers) to represent and operate on shredded typed_value layouts.
  • Extends Variant scalar tooling with cross-metadata transcoding support (VariantValueWriter.CopyValue) and a metadata prepass helper (VariantMetadataBuilder.CollectFieldNames).
  • Adds a regeneration script and checks in Arrow IPC fixtures converted from the Parquet shredded-variant corpus.

Reviewed changes

Copilot reviewed 29 out of 166 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/shredded_variant_ipc/regen.py Script to regenerate Arrow IPC fixtures from the parquet-testing shredded-variant corpus.
test/shredded_variant_ipc/case-*.arrow (many files) Checked-in Arrow IPC fixtures generated from the shredded-variant Parquet test corpus.
src/Apache.Arrow.Scalars/Variant/VariantValueWriter.cs Adds CopyValue(VariantReader) to transcode values while re-resolving field IDs against a target metadata dictionary.
src/Apache.Arrow.Scalars/Variant/VariantValue.cs Adds FromDecimal16(SqlDecimal) to preserve Decimal16 intent and support values beyond decimal range.
src/Apache.Arrow.Scalars/Variant/VariantMetadataBuilder.cs Adds CollectFieldNames(VariantReader) for two-pass encode workflows.
src/Apache.Arrow.Operations/Shredding/ShreddingHelpers.cs Adds shared helper to construct per-row ShreddedVariant slots from element-group structs.
src/Apache.Arrow.Operations/Shredding/ShredType.cs Defines the shredding type system for typed_value columns (primitive + object/array).
src/Apache.Arrow.Operations/Shredding/ShredOptions.cs Adds schema inference tuning options (depth, frequency, type consistency).
src/Apache.Arrow.Operations/Apache.Arrow.Operations.csproj Adds a project reference to Apache.Arrow to support shredding operations over Arrow arrays/types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
}
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FromDecimal16(SqlDecimal) converts to decimal via value.Value when value.Data[3] == 0. SqlDecimal.Value can still throw for values that aren't representable as System.Decimal (e.g., scale/precision beyond decimal’s limits) even when the magnitude fits in 96 bits. Consider storing the SqlDecimal instance in those cases (or using a try/catch fallback) so Decimal16 materialization can’t unexpectedly overflow.

Suggested change
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
try
{
return new VariantValue(VariantPrimitiveType.Decimal16, (object)value.Value);
}
catch (OverflowException)
{
SqlDecimal normalized = SqlDecimal.ConvertToPrecScale(value, 38, value.Scale);
return new VariantValue(VariantPrimitiveType.Decimal16, (object)normalized);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@CurtHagenlocher CurtHagenlocher Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a followup change I'm going to always store a Decimal16 as a SqlDecimal and vice versa. The current "convert to decimal if it fits" strategy is unnecessarily complicated. Filed #33 to cover this.

{
StructType elementGroupType = (StructType)elementGroup.Data.DataType;
int valueIdx = elementGroupType.GetFieldIndex("value");
int typedIdx = elementGroupType.GetFieldIndex("typed_value");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be cached; need to take a second look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Shouldn't affect the public API, so can be done as a followup.)

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing this but didn't get very far, I'll just leave the couple of comments I have for now

Comment thread test/shredded_variant_ipc/regen.py
Comment thread src/Apache.Arrow.Operations/Shredding/ShreddedArray.cs Outdated
Copy link
Copy Markdown
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me thanks Curt, only a few minor comments.

Comment thread src/Apache.Arrow/Arrays/VariantArray.cs Outdated
Comment thread test/Apache.Arrow.Operations.Tests/Shredding/ShreddedVariantArrayBuilderTests.cs Outdated
BinaryArray metadataArr = metadataBuilder.Build(allocator);

// value column: residual bytes (or null).
BinaryArray valueArr = BuildBinaryColumn(rows, allocator);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we omit the value array if values are fully shredded? Probably fine to add that as an optimisation later though if there's a need for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with doing that is a hypothetical scenario where we're shredding a column in a very large table. We get the values as an IArrowArrayStream instead of an IArrowArray and we run each of the batches through ShredSchemaInferrer, leaving us with a ShredSchema. Now we take a second pass through the IArrowArrayStream and shred the batches, one at a time. Each of the batches will need to conform to the shredded schema and we can't just omit values in one of them without knowing whether or not it can be omitted in all of them.

In short, I think this would require a separate knob based on the bigger picture.

Comment thread test/Apache.Arrow.Operations.Tests/Shredding/ShreddedVariantReaderTests.cs Outdated
@CurtHagenlocher CurtHagenlocher merged commit 8e35d8f into apache:main Apr 29, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants